First lets install the appropriate packages
library(printr)
library(fpp2)
library(tidyverse)
library(ggthemes)
library(jtools)
theme_set(theme_clean())# First we download the data from fpp2 to save as a dataframe with Date for pleasant aesthetics
data("elecdaily")
elec.df <- as.data.frame(elecdaily)
elec.df$date <- seq(from = as.Date("2014-01-01"), to = as.Date("2014-12-31"), by = 1)
elec.df%>%
ggplot(aes(x = date, y = Demand)) + geom_line(color = 'red') +
labs(title = 'Victoria, Australia Electricity usage during 2014',
x = 'Date',
y = 'Demand')Total electricity demand in GW for Victoria, Australia, every half-hour during 2014
elec.df %>%
ggplot(aes(x = date, y = Temperature)) + geom_line(color = 'blue') +
labs(title = 'Victoria, Australia Temperature during 2014',
x = 'Date',
y = 'Temperature')Half-hourly temeperatures for Melbourne in celsius
By the way, the temperature plot is accurate and starts in January. Melbourne experienced maximum temperatures on January 16-17, 2014. At first I thought the start date for the temperature collection was not January but it turns out it is! Lets go on to check out the relationship between daily electricity usage and temperature.
# lets use the fpp2 tslm() function
lm1 <- tslm(Demand ~ Temperature, data = elecdaily)
summ1 <- summ(lm1, model.info=T)
export_summs(summ1, output='html',
statistics = c("N" = "nobs",
"R squared" = "r.squared",
"P value" = "p.value",
"AIC" = "AIC"),
model.names = c('Electricity Demand'))| Electricity Demand | |
|---|---|
| (Intercept) | 212.39 *** |
| (5.01) | |
| Temperature | 0.42 |
| (0.23) | |
| N | 365 |
| R squared | 0.01 |
| P value | 0.07 |
| AIC | 3433.49 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
We can see that there is a positive relationship between temperature and electricity demand. This makes sense - if the temperature increases, we would imagine people would cool themselves down by turning on a fan or air conditioning. Therefore, higher temperatures would lead to more electricity use as opposed to lower temperatures which would increase heat usage. For a quick reference, the average temperature in 2014 in Melbourne was 21.26 celsius.
The question asks us to use the first 20 days, so from here on out we will be using the first 20 days data. The regression for it is given below.
daily20 <- head(elecdaily, 20)
daily.df <- as.data.frame(daily20)
daily.df$date <- seq(from = as.Date("2014-01-01"), to = as.Date("2014-01-20"), by = 1)
lm20 <- tslm(Demand ~ Temperature, data = daily20)
summ20 <- summ(lm20, model.info=T)
export_summs(summ20, output='html',
statistics = c("N" = "nobs",
"R squared" = "r.squared",
"P value" = "p.value",
"AIC" = "AIC"),
model.names = c('Electricity Demand'))| Electricity Demand | |
|---|---|
| (Intercept) | 39.21 * |
| (17.99) | |
| Temperature | 6.76 *** |
| (0.61) | |
| N | 20 |
| R squared | 0.87 |
| P value | 0.00 |
| AIC | 184.30 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
We get the same positive relationship, but it is now significant at the 1% level.
We can check our assumptions with checkresiduals() from the fpp2 package.
##
## Breusch-Godfrey test for serial correlation of order up to 5
##
## data: Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774
We can see that for the model which includes the first 20 days there are not outliers. Furthermore, we do not have evidence of auto correlation. A fit of the regression is plotted below. NOTE This would not be the case if we were to use the entire yearly daily dataset.
daily.df %>%
ggplot(aes(x = Temperature, y = Demand)) +
geom_point() +
geom_smooth(method = 'lm', se= F)I would say this model is adequate for forecasting Demand or explaining the relationship between Temperature and Demand.
forecast <- forecast(lm20,
newdata = data.frame(Temperature = c(15,35)))
knitr::kable(forecast) %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F,
position = 'float_right')| Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|
| 4.285714 | 140.5701 | 108.6810 | 172.4591 | 90.21166 | 190.9285 |
| 4.428571 | 275.7146 | 245.2278 | 306.2014 | 227.57056 | 323.8586 |
The first row lists the forecast for 15 degrees celsius and then 35 degrees celsius. I would say that the prediction for 35 degrees celsius is fairly accurate because there is a recorded temperature within the first 20 days that is close to 35 celsius (We have a data point that is 34 degrees celsius). The forecast for 15 degrees might be slightly less so, since the closest temperature we have to is is 19.6 degrees celsius.
The entire plot can be found under part a) Data exploration and as is suggested by the plot and discussion there, this relationship is not as strong or useful for prediction electricity usage for the entirety of the year.
data("mens400")
autoplot(mens400) +
labs(title = 'Winning Olympic Times over the Years',
x = 'Year',
y = 'Winning Time')Winning times (in seconds) for the men’s 400 meters final in each Olympic Games
Two things stand out from the graph of the winning times. First, we have some missing values, this is because the Olympic Games did not take place during WW1 or WW2. Secondly, The winning time has been decreasing throughout the years, which means the athletes are getting better. However, this trend has been leveling off since the late 1990s.
# To calculate the average rate of decrease, we will make a dataframe object
mens.df <- as.data.frame(mens400)
# I just wanted to have a proper dependent variable name
mens.df$win <- mens.df$x
# We could use a years variable, but I think its better to use a trend variable for better interpretability in the regression results
mens.df$t <- seq(0, (2016-1896), by = 4)
lm_mens <- tslm(win ~ t, data = mens.df)
summ_mens <- summ(lm_mens, model.info=T)
export_summs(summ_mens, output='html',
statistics = c("N" = "nobs",
"R squared" = "r.squared",
"P value" = "p.value",
"AIC" = "AIC"),
model.names = c('Win Time'),
coefs = c("Trend" = 't'))| Win Time | |
|---|---|
| Trend | -0.06 *** |
| (0.01) | |
| N | 31 |
| R squared | 0.82 |
| P value | 0.00 |
| AIC | |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
# we can use the regular year variable for the actual plotting
autoplot(mens400) +
labs(title = 'Winning Olympic Times over the Years',
x = 'Year',
y = 'Winning Time') +
geom_smooth(method = 'lm', se = F)We can see that, on average, the winning time has decreased by around six seconds year over year.
##
## Breusch-Godfrey test for serial correlation of order up to 6
##
## data: Residuals from Linear regression model
## LM test = 3.6082, df = 6, p-value = 0.7295
Besides a couple outliers and a violation of auto correlation, we can say that the simple regression generally fits the data well.
forecast <- forecast(lm_mens,
newdata = data.frame(t = 124))
knitr::kable(forecast) %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F,
position = 'float_right')| Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|
| 2020 | 42.04231 | 40.44975 | 43.63487 | 39.55286 | 44.53176 |
The point prediction for the win time of the men’s 400 for the 2020 Olympics is 42 seconds, but it could confidently be as low as 39 seconds or as high as 44.5 seconds. This forecast, using a linear model, assumes the underlying times are given as a normal distribution but this isn’t the case, as can be seen from [c) Residuals of out simple regression]
Ausbeer returns total quarterly beer production in Australia (in megalitres) from 1956:Q1 to 2010:Q2.
Easter() returns a vector of 0’s and 1’s or fractional results if Easter spans March and April in the observed time period. Easter is defined as the days from Good Friday to Easter Sunday inclusively, plus optionally Easter Monday if easter.mon=TRUE.
\[ 1:ln(y) = \beta_0 + \beta_1ln(x) + \epsilon \\ 2: \beta_1*\frac{1}{x} = \frac{dy}{dx}*\frac{1}{y} \\ 3:\beta_1 = \frac{dy}{dx}*\frac{x}{y} \]
We can see from the above graph that the sales are increasing over time and they are highly seasonal, as is expected from the problem description - sales spike in March.
We have that the seasonal variations in our data that pushes the range upwards, here is a histogram of the sales.
A logarithm transform would allow us to make the data more normal and therefore we could use the linear model, the transformed histogram is shown below.
Below are the results of using the log transfromed fancy with a seasonal fit.
fancy.ln <- log(fancy)
# Surfing festival dummy
fest <- rep(0, length(fancy))
fest[seq_along(fest)%%12 == 3] <- 1
# The festival started in 1988, our data goes back to 1987
fest[3] <- 0
# Save it as a timeseries
fest <- ts(fest, start = c(1987,1), freq = 12)
fancy.df <- data.frame(fancy.ln, fest)
lm.ln <- tslm(fancy.ln ~ trend + season + fest, data = fancy.df)
export_summs(summ(lm.ln), output='html',
statistics = c("N" = "nobs",
"R squared" = "r.squared",
"P value" = "p.value",
"AIC" = "AIC"),
model.names = c('Log Fancy Sales'))| Log Fancy Sales | |
|---|---|
| (Intercept) | 7.62 *** |
| (0.07) | |
| trend | 0.02 *** |
| (0.00) | |
| season2 | 0.25 * |
| (0.10) | |
| season3 | 0.27 |
| (0.19) | |
| season4 | 0.38 *** |
| (0.10) | |
| season5 | 0.41 *** |
| (0.10) | |
| season6 | 0.45 *** |
| (0.10) | |
| season7 | 0.61 *** |
| (0.10) | |
| season8 | 0.59 *** |
| (0.10) | |
| season9 | 0.67 *** |
| (0.10) | |
| season10 | 0.75 *** |
| (0.10) | |
| season11 | 1.21 *** |
| (0.10) | |
| season12 | 1.96 *** |
| (0.10) | |
| fest | 0.50 * |
| (0.20) | |
| N | 84 |
| R squared | 0.96 |
| P value | 0.00 |
| AIC | -35.96 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
plot(lm.ln$residuals,
main = 'Residuals of Log Model for fancy sales',
ylab = 'Residuals',
type = 'p')I don’t see anything particularly interesting with the residuals against time.
data.frame(resid = lm.ln$residuals, fit = lm.ln$fitted.values) %>%
ggplot(aes(x=fit, y=resid)) + geom_point() +
labs(title = 'Fitted Values vs Residuals for Log Fancy Regression',
y = 'Residuals',
x = 'Fitted Values')There is nothing too out of the ordinary with the fitted values either.
boxplot(res ~ month, data = data.frame(res = lm.ln$residuals, month = rep(1:12,7)),
main = 'Boxplot of Resdiuals by Months')From the boxplots we can see that months 8 (August), 9 (September), and 10 (October) can have a high variance, in addition to months 5 (April) and 3 (March).
Because we have a log transformed dependent variable, we would have to interpret the exact effect in terms of log effects. Otherwise, we can look at the sign and magnitude of our effects. The coefficients of each season, or month variable, tell in which direction sales move during that season. All of these variables are positive which matches the graphs since the tend is increasing. The trend and the festival coefficients are also important in forecasting and determining the pattern of sales.
##
## Durbin-Watson test
##
## data: lm.ln
## DW = 0.88889, p-value = 1.956e-07
## alternative hypothesis: true autocorrelation is not 0
The DW statistic is significant which indicates that our model is still suffering from autocorrelation which violates the linear model assumptions the are required to fit a model.
# new festival dates
new_fes = rep(0, 36)
new_fes[seq_along(new_fes)%%12 == 3] <- 1
forecast <- forecast(lm.ln, newdata=data.frame(fest = new_fes))
knitr::kable(forecast) %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F)| Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|
| Jan 1994 | 9.491352 | 9.238522 | 9.744183 | 9.101594 | 9.88111 |
| Feb 1994 | 9.764789 | 9.511959 | 10.017620 | 9.375031 | 10.15455 |
| Mar 1994 | 10.302990 | 10.048860 | 10.557120 | 9.911228 | 10.69475 |
| Apr 1994 | 9.941465 | 9.688635 | 10.194296 | 9.551707 | 10.33122 |
| May 1994 | 9.988919 | 9.736088 | 10.241749 | 9.599161 | 10.37868 |
| Jun 1994 | 10.050280 | 9.797449 | 10.303110 | 9.660522 | 10.44004 |
| Jul 1994 | 10.233926 | 9.981096 | 10.486756 | 9.844168 | 10.62368 |
| Aug 1994 | 10.233456 | 9.980625 | 10.486286 | 9.843698 | 10.62321 |
| Sep 1994 | 10.336841 | 10.084010 | 10.589671 | 9.947083 | 10.72660 |
| Oct 1994 | 10.436923 | 10.184092 | 10.689753 | 10.047165 | 10.82668 |
| Nov 1994 | 10.918299 | 10.665468 | 11.171129 | 10.528541 | 11.30806 |
| Dec 1994 | 11.695812 | 11.442981 | 11.948642 | 11.306054 | 12.08557 |
| Jan 1995 | 9.755590 | 9.499844 | 10.011336 | 9.361338 | 10.14984 |
| Feb 1995 | 10.029027 | 9.773281 | 10.284773 | 9.634775 | 10.42328 |
| Mar 1995 | 10.567228 | 10.310517 | 10.823938 | 10.171489 | 10.96297 |
| Apr 1995 | 10.205703 | 9.949957 | 10.461449 | 9.811451 | 10.59996 |
| May 1995 | 10.253157 | 9.997411 | 10.508903 | 9.858904 | 10.64741 |
| Jun 1995 | 10.314518 | 10.058772 | 10.570264 | 9.920265 | 10.70877 |
| Jul 1995 | 10.498164 | 10.242418 | 10.753910 | 10.103911 | 10.89242 |
| Aug 1995 | 10.497694 | 10.241948 | 10.753440 | 10.103441 | 10.89195 |
| Sep 1995 | 10.601079 | 10.345333 | 10.856825 | 10.206826 | 10.99533 |
| Oct 1995 | 10.701161 | 10.445415 | 10.956907 | 10.306908 | 11.09541 |
| Nov 1995 | 11.182537 | 10.926791 | 11.438282 | 10.788284 | 11.57679 |
| Dec 1995 | 11.960050 | 11.704304 | 12.215796 | 11.565797 | 12.35430 |
| Jan 1996 | 10.019828 | 9.760563 | 10.279093 | 9.620151 | 10.41951 |
| Feb 1996 | 10.293265 | 10.034000 | 10.552530 | 9.893588 | 10.69294 |
| Mar 1996 | 10.831466 | 10.571566 | 11.091365 | 10.430810 | 11.23212 |
| Apr 1996 | 10.469941 | 10.210676 | 10.729206 | 10.070264 | 10.86962 |
| May 1996 | 10.517395 | 10.258130 | 10.776659 | 10.117717 | 10.91707 |
| Jun 1996 | 10.578756 | 10.319491 | 10.838021 | 10.179079 | 10.97843 |
| Jul 1996 | 10.762402 | 10.503137 | 11.021667 | 10.362725 | 11.16208 |
| Aug 1996 | 10.761931 | 10.502667 | 11.021196 | 10.362255 | 11.16161 |
| Sep 1996 | 10.865317 | 10.606052 | 11.124582 | 10.465640 | 11.26499 |
| Oct 1996 | 10.965399 | 10.706134 | 11.224664 | 10.565722 | 11.36508 |
| Nov 1996 | 11.446775 | 11.187510 | 11.706039 | 11.047097 | 11.84645 |
| Dec 1996 | 12.224288 | 11.965023 | 12.483553 | 11.824611 | 12.62396 |
forecast$mean <- exp(forecast$mean)
forecast$upper <- exp(forecast$upper)
forecast$lower <- exp(forecast$lower)
forecast$x <- exp(forecast$x)
knitr::kable(forecast) %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F)| Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|
| Jan 1994 | 13244.70 | 10285.82 | 17054.73 | 8969.583 | 19557.43 |
| Feb 1994 | 17409.81 | 13520.45 | 22418.00 | 11790.284 | 25707.73 |
| Mar 1994 | 29821.65 | 23129.40 | 38450.24 | 20155.412 | 44123.68 |
| Apr 1994 | 20774.16 | 16133.21 | 26750.16 | 14068.696 | 30675.62 |
| May 1994 | 21783.73 | 16917.24 | 28050.15 | 14752.395 | 32166.37 |
| Jun 1994 | 23162.27 | 17987.81 | 29825.24 | 15685.969 | 34201.95 |
| Jul 1994 | 27831.56 | 21613.98 | 35837.72 | 18848.111 | 41096.73 |
| Aug 1994 | 27818.48 | 21603.82 | 35820.87 | 18839.249 | 41077.41 |
| Sep 1994 | 30848.42 | 23956.87 | 39722.43 | 20891.193 | 45551.50 |
| Oct 1994 | 34095.57 | 26478.61 | 43903.67 | 23090.230 | 50346.32 |
| Nov 1994 | 55176.84 | 42850.31 | 71049.28 | 37366.903 | 81475.41 |
| Dec 1994 | 120067.79 | 93244.59 | 154607.08 | 81312.400 | 177294.90 |
| Jan 1995 | 17250.40 | 13357.65 | 22277.59 | 11629.938 | 25587.08 |
| Feb 1995 | 22675.20 | 17558.28 | 29283.31 | 15287.252 | 33633.55 |
| Mar 1995 | 38840.85 | 30046.98 | 50208.44 | 26146.972 | 57697.39 |
| Apr 1995 | 27057.06 | 20951.33 | 34942.16 | 18241.435 | 40133.06 |
| May 1995 | 28371.96 | 21969.51 | 36640.25 | 19127.918 | 42083.42 |
| Jun 1995 | 30167.42 | 23359.80 | 38958.95 | 20338.387 | 44746.58 |
| Jul 1995 | 36248.88 | 28068.91 | 46812.70 | 24438.412 | 53767.06 |
| Aug 1995 | 36231.84 | 28055.72 | 46790.69 | 24426.922 | 53741.78 |
| Sep 1995 | 40178.16 | 31111.50 | 51887.06 | 27087.467 | 59595.26 |
| Oct 1995 | 44407.37 | 34386.35 | 57348.77 | 29938.733 | 65868.34 |
| Nov 1995 | 71864.42 | 55647.40 | 92807.48 | 48449.831 | 106594.69 |
| Dec 1995 | 156380.86 | 121091.75 | 201954.08 | 105429.448 | 231955.81 |
| Jan 1996 | 22467.57 | 17336.40 | 29117.46 | 15065.329 | 33506.86 |
| Feb 1996 | 29533.04 | 22788.25 | 38274.14 | 19802.984 | 44043.89 |
| Mar 1996 | 50587.81 | 39009.73 | 65602.25 | 33887.802 | 75517.62 |
| Apr 1996 | 35240.15 | 27191.96 | 45670.42 | 23629.808 | 52555.15 |
| May 1996 | 36952.72 | 28513.41 | 47889.88 | 24778.151 | 55109.18 |
| Jun 1996 | 39291.20 | 30317.82 | 50920.48 | 26346.183 | 58596.65 |
| Jul 1996 | 47211.93 | 36429.60 | 61185.57 | 31657.322 | 70409.18 |
| Aug 1996 | 47189.73 | 36412.48 | 61156.80 | 31642.439 | 70376.07 |
| Sep 1996 | 52329.57 | 40378.47 | 67817.91 | 35088.887 | 78041.33 |
| Oct 1996 | 57837.85 | 44628.77 | 74956.52 | 38782.394 | 86256.08 |
| Nov 1996 | 93598.96 | 72222.70 | 121302.09 | 62761.521 | 139588.15 |
| Dec 1996 | 203676.38 | 157160.50 | 263959.89 | 136572.460 | 303751.35 |
I think we could have used a seasonal multiplicative model to forecast the future sales, since there is clearly an increasing seasonal trend.
Lets first plot the data of the entire dataset.
Gasoline measured in millions of gallons a day
Lets try a fourier series with harmonic k = 1 and k = 5 first.
# We can only use up to 2004 for our training data
gas.2004 <- window(gasoline, end = 2005)
# First order
fit.1 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K = 1))
# Fifth order
fit.5 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K = 5))
autoplot(gas.2004) +
autolayer(fit.1$fitted.values,
series = 'K = 1',
size = 2) +
autolayer(fit.5$fitted.values,
series = 'K = 5',
size = 1.2) +
ggtitle('Fourier Transform K=1 and K=5') +
ylab("Gasoline")We can see that a harmonic with K=5 tracks the information better than a K=1 harmonic, which is more rigid being that it is only order one. Now we can see the improvement that K=10, 15, and 20 add to our model.
fit.10 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K = 10))
fit.15 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K = 15))
fit.20 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K = 20))
autoplot(gas.2004) +
autolayer(fit.5$fitted.values,
series = 'K = 5',
size = 2.5) +
autolayer(fit.10$fitted.values,
series = 'K = 10',
size = 2) +
autolayer(fit.15$fitted.values,
series = 'K = 15',
size = 1) +
autolayer(fit.20$fitted.values,
series = 'K = 20',
size = 0.75) +
ggtitle('Fourier Transforms K=5 to K = 20') +
ylab("Gasoline") We can see from the overlays that the next iterations produce a better a fit although the amplitude seems to get smaller but the variance is captured better.
cvs <- as.data.frame(rbind(CV(fit.1), CV(fit.5), CV(fit.10),
CV(fit.15), CV(fit.20)))
rownames(cvs) <- c('K = 1', 'K = 5', 'K = 10',
'K = 15', 'K = 20')
cvs %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = F)| CV | AIC | AICc | BIC | AdjR2 | |
|---|---|---|---|---|---|
| K = 1 | 0.0820192 | -1813.292 | -1813.208 | -1790.354 | 0.8250858 |
| K = 5 | 0.0755365 | -1873.234 | -1872.723 | -1813.596 | 0.8406928 |
| K = 10 | 0.0713575 | -1915.014 | -1913.441 | -1809.500 | 0.8516102 |
| K = 15 | 0.0719086 | -1910.231 | -1906.988 | -1758.841 | 0.8525942 |
| K = 20 | 0.0722383 | -1908.017 | -1902.469 | -1710.753 | 0.8540588 |
From these we can see that we should choose the K = 10 model, the K values above 10 overfit the training data.
###c) Residuals
Below are the residuals graphs
##
## Breusch-Godfrey test for serial correlation of order up to 104
##
## data: Residuals from Linear regression model
## LM test = 155.45, df = 104, p-value = 0.0008135
forecast <- forecast(fit.10,
newdata=data.frame(fourier(gas.2004, K = 10, h = 52)))
autoplot(window(gasoline, start = 2004, end = 2006))+
autolayer(forecast) +
labs(title = 'Forecasts of 2006',
y = 'Gasoline')We can see that our forecasting interval does a pretty good job of predicting the next year’s actual value. Below is a graph without the interval.
autoplot(window(gasoline, start = 2004, end = 2006))+
autolayer(forecast$mean, color = 'blue') +
labs(title = 'Mean Gasoline Forecasts',
y = 'Gasoline')When we remove the interval we see that the model does not predict the large drop in mid 2005, this might be a seasonal event or an unforeseen one. But this drop is not too surprising, it is within the 95% interval.
data("LakeHuron")
autoplot(LakeHuron) +
labs(title = 'Water Level of Lake Huron over Time',
y = 'Water Level')Water level in Feet
We don’t see a very obvious trend in the data, there might be some seasonality or cyclicality but its hard to tell and it appears that the long term average level of the lake is rather constant.
# Regular linear model
lm.water <- tslm(huron ~ trend)
export_summs(summ(lm.water), output='html',
statistics = c("N" = "nobs",
"R squared" = "r.squared",
"P value" = "p.value",
"AIC" = "AIC"),
model.names = c('Water Level'))| Water Level | |
|---|---|
| (Intercept) | 10.20 *** |
| (0.23) | |
| trend | -0.02 *** |
| (0.00) | |
| N | 98 |
| R squared | 0.27 |
| P value | 0.00 |
| AIC | 306.10 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
# Piecewise
t <- time(huron)
t.break <- 1915
t.piece <- ts(pmax(0,t-t.break), start=1875)
pw.huron <- tslm(huron ~ t + t.piece)
export_summs(summ(pw.huron), output='html',
statistics = c("N" = "nobs",
"R squared" = "r.squared",
"P value" = "p.value",
"AIC" = "AIC"),
model.names = c('Water Level'))| Water Level | |
|---|---|
| (Intercept) | 132.91 *** |
| (19.98) | |
| t | -0.06 *** |
| (0.01) | |
| t.piece | 0.06 *** |
| (0.02) | |
| N | 98 |
| R squared | 0.38 |
| P value | 0.00 |
| AIC | 291.77 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
We can see that, due to the lower AIC, the piecewise function appears to be a better fit for the data than the linear model.
# 1980 is 8 years forward
# First lets fit the linear trend
forecast <- forecast(lm.water, h=8)
# Then the piecewise
t.new <- t[length(t)] + seq(8)
t.piece.new <- t.piece[length(t.piece)]+seq(8)
forecast.pw <- forecast(pw.huron,newdata = data.frame('t' = t.new, 't.piece' = t.piece.new))
# Plotting the linear
autoplot(huron) +
autolayer(forecast) +
autolayer(lm.water$fitted.values, color = 'blue') +
labs(title = 'Water Level of Lake Huron over Time',
y = 'Water Level')autoplot(huron) +
autolayer(forecast.pw) +
autolayer(pw.huron$fitted.values, color = 'blue') +
labs(title = 'Water Level of Lake Huron over Time',
y = 'Water Level')We can see that the piecewise model captures the initial drop that we see in the series, where as the linear model fails to do so.
Centered moving averages can be smoothed by another moving average. This creates a double moving average. In the case of a 3x5 moving average, this signifies a 3 period moving average of a 5 period moving average.
This means we have three periods, weight 1/3 each of our observations averaged by 5. We have these three parts that are simplified below:
\[ (\frac{X_1+X_2+X_3+X_4+X_5}{5}*\frac{1}{3}) + \frac{X_2+X_3+X_4+X_5+X_6}{5}*\frac{1}{3} + \frac{X_3+X_4+X_5+X_6+X_7}{5}*\frac{1}{3} \\ X_1*\frac{1}{15} + X_2*\frac{2}{15} + X_3*\frac{3}{15}+X_4*\frac{3}{15}+X_5\frac{3}{15}+X_6\frac{2}{15}+X_7\frac{1}{15} \]
Which are equivalent to the stated weights for the 7 period MA.
Units are in thousands
We can see a seasonal fluctuation with an upward trend for the monthly sales. From eyeballing the plot, I would say that the seasonal trend is additive.
plastics %>% decompose(type="multiplicative") %>%
autoplot() +
labs(x = "Month") +
ggtitle("Monthly Product Sales Decomposition")We can see from the graph that indeed there is a seasonal pattern in the data in addition to a general upward trend.
autoplot(plastics) +
labs(title = "Monthly Sale Predictions",
y="Plastic Sales", x = 'Month') +
autolayer((plastics/decompose(plastics, type="multiplicative")$seasonal),
series = 'Decomposed')# For reproducability
set.seed(42)
outlier <- sample(1:length(plastics), 1)
plastics2 <- plastics
plastics2[outlier] <- plastics[outlier] + 500
autoplot(plastics2) +
labs(title = "Monthly Sales, Observation 19 Outlier",
y="Plastic Sales", x = 'Month') +
autolayer((plastics2/decompose(plastics2, type="multiplicative")$seasonal),
series = 'Decomposed')We can see that the decomposition picked up on the outlier and included it in the backtrack forecasting.
# For reproducability
plastics3 <- plastics
plastics3[(length(plastics)-1)] <- plastics[(length(plastics)-1)] + 500
autoplot(plastics3) +
labs(title = "Monthly Sales, End Series Outlier",
y="Plastic Sales", x = 'Month') +
autolayer((plastics3/decompose(plastics3, type="multiplicative")$seasonal),
series = 'Decomposed')We can see that an outlier near the end can also be handled pretty well by the decomposition.
temp = tempfile(fileext = ".xlsx")
url <- 'https://otexts.com/fpp2/extrafiles/retail.xlsx'
download.file(url, destfile=temp, mode='wb')
retail <- readxl::read_excel(temp, skip = 1)
retail.ts <- ts(retail[,17], frequency=12, start=c(1982,4))
autoplot(retail.ts, main="Monthly Austrailian Retail Sales", ylab="Sales", xlab="Year")library(seasonal)
retail.x11 <- seas(retail.ts, x11="")
autoplot(retail.x11,
main="Monthly Austrailian Retail Sales X11 Decomposition",
xlab="Year")The last series in the decomposition graph reveals outliers around 1987, 1993/4, 2003, and others in the 2010s. We were previously unable to see these in the original graphs.
The graph shows the number of persons in the civilian labor force in Austrialia each month from February 1978 to August 1995. There is a strong positive trend in the graph. The decomposition shows the growth in the labor is cyclical with seasonality. Looking at the last series, there are some major outliers around 1992/3. This could be a one time event in Australia that affected the labor force, maybe something like a census accounting change. The second chart measures the degree of seasonality and shows larges for July and decrease in March.
The 1991/2 recession is visible from the irregular series at the end, we can see major decreases around those years in the labor force for the months in that year.
data(cangas)
autoplot(cangas,
main = 'Canadian Oil Production',
ylab = 'Gas Production in Billions of Cubic Metres',
xlab = 'Year')ggsubseriesplot(cangas,
main = 'Canadian Oil Production Sub Series',
ylab = 'Gas Production in Billions of Cubic Metres',
xlab = 'Month')ggseasonplot(cangas,
main = 'Canadian Oil Production Seasonal Plot',
ylab = 'Gas Production in Billions of Cubic Metres',
xlab = 'Month')I think that the change in seasonality over the different years, the fact that it is less pronounced, might be due to two things. One is technological change that allows more steady production of oil throughout the year. The second is probably new oil reserves being found throughout time.
stl(cangas,
t.window=13,
s.window="periodic",
robust=TRUE) %>%
autoplot(main = 'STL Decomposition', xlab = 'Years')The results of the decompositions are somewhat similar. The remainder plot shows us all the outliers that between 1980 and 1990 that might’ve been affecting prvious breakdowns. These outliers are smaller later on which explains the decrease in seasonality that was discussed in previous sections.
autoplot(bricksq,
main="Clay Production", ylab="Units", xlab="Year") +
autolayer((bricksq-decompose(bricksq, type="additive")$seasonal),
series = 'Decomposed')autoplot(bricksq,
main="Clay Production", ylab="Units", xlab="Year") +
autolayer(naive((bricksq-decompose(bricksq, type="additive")$seasonal),
h=30), series="Naïve")##
## Ljung-Box test
##
## data: Residuals from STL + ETS(M,N,N)
## Q* = 41.128, df = 6, p-value = 2.733e-07
##
## Model df: 2. Total lags used: 8
It looks like the residuals are for the most part normally distributed.
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(M,N,N)
## Q* = 28.163, df = 6, p-value = 8.755e-05
##
## Model df: 2. Total lags used: 8
It doesn’t seem to have made much of a difference. In fact, there are more autocorrelation problems with the robust version.
Brick.train <- subset(bricksq, end=length(bricksq) - 9)
Brick.test <- subset(bricksq, start=length(bricksq) - 8)
brick <- snaive(Brick.train)
brick.stlf <- stlf(Brick.train, robust=TRUE)
autoplot(brick) It looks like the forecasts with STLF are less variable and petter predict production.